Editorial: Use consistent notation for code points.#1724
Conversation
| </li> | ||
| <li> | ||
| It is a Syntax Error if FlagText of |RegularExpressionLiteral| contains any code points other than `"g"`, `"i"`, `"m"`, `"s"`, `"u"`, or `"y"`, or if it contains the same code point more than once. | ||
| It is a Syntax Error if FlagText of |RegularExpressionLiteral| contains any code points other than `g`, `i`, `m`, `s`, `u`, or `y`, or if it contains the same code point more than once. |
There was a problem hiding this comment.
These are code points, not code units. (So the status quo is kind of wrong as well, but this PR probably shouldn't propagate that.)
There was a problem hiding this comment.
Agh, right, I am conflating the two (though in my defense, it's not just me—compare lines 25051 and 25076 here 😅). Is your vote to just worry about the code point cases and let (sequences of) code units be represented as strings?
| <h1>Pattern Semantics</h1> | ||
| <p>A regular expression pattern is converted into an internal procedure using the process described below. An implementation is encouraged to use more efficient algorithms than the ones listed below, as long as the results are the same. The internal procedure is used as the value of a RegExp object's [[RegExpMatcher]] internal slot.</p> | ||
| <p>A |Pattern| is either a BMP pattern or a Unicode pattern depending upon whether or not its associated flags contain a `"u"`. A BMP pattern matches against a String interpreted as consisting of a sequence of 16-bit values that are Unicode code points in the range of the Basic Multilingual Plane. A Unicode pattern matches against a String interpreted as consisting of Unicode code points encoded using UTF-16. In the context of describing the behaviour of a BMP pattern “character” means a single 16-bit Unicode BMP code point. In the context of describing the behaviour of a Unicode pattern “character” means a UTF-16 encoded code point (<emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>). In either context, “character value” means the numeric value of the corresponding non-encoded code point.</p> | ||
| <p>A |Pattern| is either a BMP pattern or a Unicode pattern depending upon whether or not its associated flags contain a `u`. A BMP pattern matches against a String interpreted as consisting of a sequence of 16-bit values that are Unicode code points in the range of the Basic Multilingual Plane. A Unicode pattern matches against a String interpreted as consisting of Unicode code points encoded using UTF-16. In the context of describing the behaviour of a BMP pattern “character” means a single 16-bit Unicode BMP code point. In the context of describing the behaviour of a Unicode pattern “character” means a UTF-16 encoded code point (<emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>). In either context, “character value” means the numeric value of the corresponding non-encoded code point.</p> |
There was a problem hiding this comment.
This is (arguably) a code point, not a code unit.
|
To denote a code unit, we currently have a long way (e.g. This PR proposes a different short notation, i.e. the current notation minus the quotes. But that is already the notation we use for terminal symbols and ES code snippets, i.e. source text, i.e. Unicode code points. I don't think it would be a good idea to use the same notation for code units and code points. Instead, I think it would be okay for these cases to just change the backticks to asterisks as part of the one fell swoop, with the possible exception of the two cases noted. |
| </emu-alg> | ||
| <emu-note> | ||
| <p>The code point `"#"` is not decoded from escape sequences even though it is not a reserved URI code point.</p> | ||
| <p>The code point `#` is not decoded from escape sequences even though it is not a reserved URI code point.</p> |
There was a problem hiding this comment.
Seems like we should update code point → code unit on this line for consistency with surroundings (and justification for dropping the proposed change here).
There was a problem hiding this comment.
It's a bit tricky. Although the input and output of Decode are just String values, they are different ways of encoding a ("true") URI consisting of Unicode code points. So it's somewhat ambiguous whether "decoded" in this sentence is talking about the whole effect of Decode, or a particular point within it at which an escape sequence could be (but isn't) decoded into a Unicode code point.
Given the placement of the note, it seems more plausible that it would be talking about the whole effect of Decode, rather than an internal detail, in which case changing the first occurrence of "code point" to "code unit" (as you suggest) would be appropriate. (Though rewording the sentence to not have that ambiguity would be a bigger improvement, I think.)
There was a problem hiding this comment.
Alternatively, maybe it's the parallel sentence below that should change in the opposite direction? Re-reading the sentence as a whole, it seems we could argue that the note is purely about the code point #, whereas the algorithm step is building a string from code units including "#".
There was a problem hiding this comment.
Alternatively, maybe it's the parallel sentence below that should change in the opposite direction?
That is, on line 25076, change "code unit" to "code point".
Re-reading the sentence as a whole, it seems we could argue that the note is purely about the code point
#,
Yup, it's debatable.
whereas the algorithm step is building a string from code units including "#".
The algorithm is clearly building a string, although the phrase each code unit valid in |uriReserved| is slightly problematic, if we assume (as with the spec's other lexical grammars) that the terminal symbols are code points, not code units. In effect, the 'uri' grammar is not there to specify a syntax (nothing is parsed with |uri| as the goal symbol), but merely to define |uriReserved| and |uriUnescaped| as sets of code units. But that's mostly beside the point.
|
The net effect of these two commits looks good to me. |
jmdyck
left a comment
There was a problem hiding this comment.
(Okay, this time as a review:) Looks good to me.
I'd recommend squashing the two commits to reduce churn, since the second undoes most of the first.
d149859 to
93635e4
Compare
|
Added feedback from #1733 here. |
|
Yup, looks good. |
93635e4 to
ede7059
Compare
| <emu-alg> | ||
| 1. Let _s_ be SourceText of |LoneUnicodePropertyNameOrValue|. | ||
| 1. If ! UnicodeMatchPropertyValue(`"General_Category"`, _s_) is identical to a List of Unicode code points that is the name of a Unicode general category or general category alias listed in the “Property value and aliases” column of <emu-xref href="#table-unicode-general-category-values"></emu-xref>, then | ||
| 1. If ! UnicodeMatchPropertyValue(`General_Category`, _s_) is identical to a List of Unicode code points that is the name of a Unicode general category or general category alias listed in the “Property value and aliases” column of <emu-xref href="#table-unicode-general-category-values"></emu-xref>, then |
ede7059 to
cf8607e
Compare
|
(Note that the PR was merged with the original title involving "code units" rather than the revised title involving "code points", which might be a source of confusion.) |
|
Oops, thanks. Hopefully the linked PR will resolve any potential future confusion. |
Re-evaluating all cases of
`"foo"`throughout the spec (so as to switch to*"foo"*in one fell swoop) reveals that our notation for codeunitspoints isn't entirely consistent.I'm starting with the maximal patch here for discussion's sake, but I'm happy to drop any (or all) of these changes; my only goal is to avoid blindly upgrading to asterisks without adequate consideration.
(CC: @jmdyck)